Weighted Bandits or: How Bandits Learn Distorted Values That Are Not Expected
نویسندگان
چکیده
Motivated by models of human decision making proposed to explain commonly observed deviations from conventional expected value preferences, we formulate two stochastic multi-armed bandit problems with distorted probabilities on the cost distributions: the classic K-armed bandit and the linearly parameterized bandit. In both settings, we propose algorithms that are inspired by Upper Confidence Bound (UCB) algorithms, incorporate cost distortions, and exhibit sublinear regret assuming Hölder continuous weight distortion functions. For the K-armed setting, we show that the algorithm, called W-UCB, achieves problem-dependent regret O ( LM logn/∆ 2 α −1 ) , where n is the number of plays, ∆ is the gap in distorted expected value between the best and next best arm, L and α are the Hölder constants for the distortion function, and M is an upper bound on costs, and a problem-independent regret bound of O((KL2M2)α/2n(2−α)/2). We also present a matching lower bound on the regret, showing that the regret of W-UCB is essentially unimprovable over the class of Hölder -continuous weight distortions. For the linearly parameterized setting, we develop a new algorithm, a variant of the Optimism in the Face of Uncertainty Linear bandit (OFUL) algorithm Abbasi-Yadkori et al. [2011] called WOFUL (Weight-distorted OFUL), and show that it has regret O(d √ n polylog(n)) with high probability, for sub-Gaussian cost distributions. Finally, numerical examples demonstrate the advantages resulting from using distortion-aware learning algorithms.
منابع مشابه
Modal Bandits
Analyses of multi-armed bandits primarily presume that the value of an arm is its expected reward. We introduce a theory for multi-armed bandits where the values are the modes of the reward distributions.
متن کاملA Generalized Gittins Index for a Class of Multiarmed Bandits with General Resource Requirements
We generalise classical multi-armed and restless bandits to allow for the distribution of a (fixed amount of a) divisible resource among the constituent bandits at each decision point. Bandit activation consumes amounts of the available resource which may vary by bandit and state. Any collection of bandits may be activated at any decision epoch provided they do not consume more resource than is...
متن کاملLipschitz Bandits: Regret Lower Bound and Optimal Algorithms
We consider stochastic multi-armed bandit problems where the expected reward is a Lipschitz function of the arm, and where the set of arms is either discrete or continuous. For discrete Lipschitz bandits, we derive asymptotic problem specific lower bounds for the regret satisfied by any algorithm, and propose OSLB and CKL-UCB, two algorithms that efficiently exploit the Lipschitz structure of t...
متن کاملAn algorithm with nearly optimal pseudo-regret for both stochastic and adversarial bandits
We present an algorithm that achieves almost optimal pseudo-regret bounds against adversarial and stochastic bandits. Against adversarial bandits the pseudo-regret is O ( K √ n log n ) and against stochastic bandits the pseudo-regret is O ( ∑ i(log n)/∆i). We also show that no algorithm with O (log n) pseudo-regret against stochastic bandits can achieve Õ ( √ n) expected regret against adaptive...
متن کاملAsymptotically optimal priority policies for indexable and non-indexable restless bandits
We study the asymptotic optimal control of multi-class restless bandits. A restless bandit isa controllable stochastic process whose state evolution depends on whether or not the bandit ismade active. Since finding the optimal control is typically intractable, we propose a class of prioritypolicies that are proved to be asymptotically optimal under a global attractor property an...
متن کامل